8 research outputs found
Object Referring in Visual Scene with Spoken Language
Object referring has important applications, especially for human-machine
interaction. While having received great attention, the task is mainly attacked
with written language (text) as input rather than spoken language (speech),
which is more natural. This paper investigates Object Referring with Spoken
Language (ORSpoken) by presenting two datasets and one novel approach. Objects
are annotated with their locations in images, text descriptions and speech
descriptions. This makes the datasets ideal for multi-modality learning. The
approach is developed by carefully taking down ORSpoken problem into three
sub-problems and introducing task-specific vision-language interactions at the
corresponding levels. Experiments show that our method outperforms competing
methods consistently and significantly. The approach is also evaluated in the
presence of audio noise, showing the efficacy of the proposed vision-language
interaction methods in counteracting background noise.Comment: 10 pages, Submitted to WACV 201
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
The role of robots in society keeps expanding, bringing with it the necessity
of interacting and communicating with humans. In order to keep such interaction
intuitive, we provide automatic wayfinding based on verbal navigational
instructions. Our first contribution is the creation of a large-scale dataset
with verbal navigation instructions. To this end, we have developed an
interactive visual navigation environment based on Google Street View; we
further design an annotation method to highlight mined anchor landmarks and
local directions between them in order to help annotators formulate typical,
human references to those. The annotation task was crowdsourced on the AMT
platform, to construct a new Talk2Nav dataset with routes. Our second
contribution is a new learning method. Inspired by spatial cognition research
on the mental conceptualization of navigational instructions, we introduce a
soft dual attention mechanism defined over the segmented language instructions
to jointly extract two partial instructions -- one for matching the next
upcoming visual landmark and the other for matching the local directions to the
next landmark. On the similar lines, we also introduce spatial memory scheme to
encode the local directional transitions. Our work takes advantage of the
advance in two lines of research: mental formalization of verbal navigational
instructions and training neural network agents for automatic way finding.
Extensive experiments show that our method significantly outperforms previous
navigation methods. For demo video, dataset and code, please refer to our
project page: https://www.trace.ethz.ch/publications/2019/talk2nav/index.htmlComment: 20 pages, 10 Figures, Demo Video:
https://people.ee.ethz.ch/~arunv/resources/talk2nav.mp
Multimodal Semantic Understanding and Navigation in Outdoor Scenes
From indoor robotics to automated cars, there is a tremendous growth in the number of robots in our day-to-day life. For instance, products such as smart speakers, wearable technologies, home robots, self-driving cars, and many more smart assistants are to come in the next few years. These robotic systems interact with humans and surrounding environments to perform their designated tasks. Research on robotic perception, visual language navigation, speech recognition, and others drive the aforementioned applications, and there has been significant progress in the past decade.
The focus of this thesis is to develop models to tackle some of the challenges and enable better robot perception and navigation systems. Perception system demands to be complex with a multitude of tasks such as understanding human cues and visual perception of the environment. To this end, we propose an approach to address the problem of Object referring (OR) task using spoken language, human gaze, and natural language text. We train and evaluate our method on Cityscapes dataset, which is augmented with human gaze and speech captured in an indoor setup. We observe that the language-guided OR task performance improves with the addition of human-side gaze and speech modalities and with visual scene modalities of RGB, depth, and motion.
Next, the thesis focuses on the challenge of the robot navigation system. The vast majority of research is targeted towards indoor or simulated outdoor navigation. Here, we define the problem of language-based robot navigation in the real outdoor environment, which has the first person view to understand and execute the natural language instructions. We create a large-scale dataset with verbal navigation instructions based on Google Street View. Experiments on our dataset show that the proposed approach helps the language-guided automatic wayfinding.
Finally what happens to the visual perception system of robots when encountered with poor lighting conditions or camera malfunctioning. Robots can then hear the environment to perceive as humans do. There are limited works in the literature related to sound perception in outdoor environments. We develop an approach to focus on dense semantic object labelling based on binaural sounds from the environment. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight binaural microphones and a \ang{360} camera. We also propose two auxiliary tasks, namely, a) a novel task on Spatial Sound Super-resolution, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end multi-tasking network, and the evaluation on our dataset shows how all three tasks are mutually beneficial
Motion Characterization of a Dynamic Scene
Given a video, there are many algorithms to separate static and dynamic objects present in the scene. The pro-posed work is focused on classifying the dynamic objects further as having either repetitive or non-repetitive motion. In this work, we propose a novel approach to achieve this challenging task by processing the optical flow fields corresponding to the video frames of a dynamic natural scene. We design an unsupervised learning algorithm which uses functions of the flow vectors to design the feature vector. The proposed algorithm is
shown to be effective in classifying a scene into static, repetitive, and non-repetitive regions. The proposed ap-proach finds significance in various vision and computational photography tasks such as video editing, video synopsis, and motion magnification.by Arun Balajee Vasudevan, Srikanth Muralidharan, Shiva Pratheek Chintapalli and Shanmuganathan Rama
A novel approach to the extraction of multiple salient objects in an image
by Srikanth Muralidharan, Arun Balajee Vasudevan, Chintapalli, Pratheek Shiva and Shanmuganathan Rama
Dynamic scene classification using spatial and temporal cues
A real world scene may contain several objects with different spatial and temporal characteristics. This paper proposes a novel method for the classification of natural scenes
by processing both spatial and temporal information from the video. For extracting the spatial characteristics, we build spatial pyramids using the spatial pyramid matching (SPM) algorithm on SIFT descriptors while for the motion characteristics, we introduce a five dimensional feature vector extracted from the optical flow field. We employ SPM on combined SIFT and motion feature descriptors to perform classification. We demonstrate that the proposed approach shows significant improvement in scene classification as compared to the SPM algorithm on SIFT spatial feature descriptors alone.by A. B. Vasudevan, S. Muralidharan, S. P. Chintapalli and Shanmuganathan Rama
Query-adaptive Video Summarization via Quality-aware Relevance Estimation
┬й 2017 Copyright held by the owner/author(s). Although the problem of automatic video summarization has recently received a lot of attention, the problem of creating a video summary that also highlights elements relevant to a search query has been less studied. We address this problem by posing query-relevant summarization as a video frame subset selection problem, which lets us optimise for summaries which are simultaneously diverse, representative of the entire video, and relevant to a text query. We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network. In addition, we extend the model to capture query-independent properties, such as frame quality. We compare our method against previous state of the art on textual-visual embeddings for thumbnail selection and show that our model outperforms them on relevance prediction. Furthermore, we introduce a new dataset, annotated with diversity and query-specific relevance labels. On this dataset, we train and test our complete model for video summarization and show that it outperforms standard baselines such as Maximal Marginal Relevance.Submitted to ACM Multimedia 2017status: publishe